The David De Gea Dilemma: Comparing Goalkeeper Greats Throughout History¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.graph_objects as go

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples

June 24th 2023¶

In [2]:
gks = ["ps","vds","cech","iker","buffon","neuer","alisson","ederson","courtois","ddg","costa","onana"]
full_name = ["Peter Schmeichel","Edwin Van de Sar","Petr Cech","Iker Casillas","Gianluigi Buffon","Manuel Neuer","Alisson Becker","Ederson","Thibaut Courtois","David De Gea","Diogo Costa","Andre Onana"]

Penalty Kicks (+ shootouts)¶

For penalties, let's inspect some data and assses how De Gea compares to other goalies who are retired and deemed as legends, not just for United but for other major European clubs, namely:¶
  1. Peter Schmeichel (Manchester United)
  2. Edwin van der Sar (Manchester United)
  3. Petr Cech (Chelsea)
  4. Iker Casillas (Real Madrid)
  5. Gianluigi Buffon (Juventus)

alongside these active players who have consistently performed at a high level:

  1. Manuel Neuer (Bayern Munich)
  2. Alisson Becker (Liverpool)
  3. Ederson (Manchester City)
  4. Thibaut Courtois (Real Madrid)

and finally with two keepers that have been on United and the fans' radar:

  1. Diogo Costa (Porto)
  2. Andre Onana (Inter Milan)
In [3]:
pk_save_rate = [3/37,11/60,17/85,23/100,39/124,22/76,14/33,7/54,14/66,14/74,11/34,7/33]

data source: Transfer Market (e.g. Onana)

In [4]:
fig = go.Figure(data=[go.Bar(x=gks, y=pk_save_rate,hovertext=full_name)])
# Customize aspect
fig.update_traces(marker_color='rgb(158,202,100)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)
fig.update_layout(title_text='Penalty Kick Save Rates among top Keepers (entire career, all comps, excluding shootouts)')
fig.show()
While De Gea's penalty saving record is not necessarily something he can particulalry be proud of, I, as a United fan, find it quite funny how he was better than both van der Sar and Peter Schmeichel, both United legends.¶
This time, let's try using KL divergence to compare this data. KL Divergence, in short, lets us compare two probability distributions by calculating the expectation of the log-odds of two distributions.¶
Here, let's assume that, based on historical data, players have an 85% chance of scoring and a 15% chance of missing, partly because I couldn't find the consensus on this statistic after going through some data sources. But it seems like the number is somewhere north of 80 percent, so let's go with 85 percent for the sake of brevity of this presentation. (Also, some data sources set aside another percentage for players completely missing the goal, but let's combine that with GK saving the penalty because it is of my opinion that a player missing in any fashion can be attributed to the keeper. It's a mental game!)¶
To that end, our base distribution will be $p_{scored} = .85$ and $p_{saved}=.15$ (the implication being that the average keeper, in the context of PKs, will prevail against the shooter only 15 percent of the time), to which we will compare each goalkeeper's individual penalty kick distribution.¶
In [5]:
p = [.85,.15]
In [6]:
def kl_divergence(p, q):
 return np.sum(p[i] * np.log(p[i]/q[i]) for i in range(len(p)))
In [7]:
gks_kld = []

for save_rate in pk_save_rate:
    q = [1-save_rate, save_rate]
    gks_kld.append(kl_divergence(p,q))

fig1= go.Figure(data=[go.Bar(x=gks, y=gks_kld,hovertext=full_name)])
# Customize aspect
fig1.update_traces(marker_color='rgb(200,202,100)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)
fig1.update_layout(title_text='KL Divergence among top Keepers (entire career in all comps, excluding shootouts)')
fig1.show()
/var/folders/75/hf538pcj7917_ym2sh21yfw00000gn/T/ipykernel_29942/933137679.py:2: DeprecationWarning:

Calling np.sum(generator) is deprecated, and in the future will give a different result. Use np.sum(np.fromiter(generator)) or the python sum builtin instead.

Two identical distributions produces a KL divergnece of 0, and thus the more similar two distributions are, the closer the KL divergence will be to 0. Thus, we can infer that:¶

  1. The likes of Edwin van de Sar, Petr Cech, Ederson, and De Gea are pretty much average PK savers.
  2. Peter Schmeichel having a higher KL divergence doesn't imply that he's better than the previous mentioned keepers, but that he's worse than the average keeper at saving penalty kicks (and we can infer this from the previous visualization where we saw his 8 percent PK save rate, lowest of the 10 keepers here)
  3. Buffon, Costa, and Neuer are great at saving PKs, but not as great as Alisson!

(The subtle assumption here is that all penalty kicks are equally difficult, regardless of the competition, whether or not the keeper's team is losing or winning at the time their team gave away a penalty, how good of a PK kicker the GK is going against, and etc.)

However, this is an analysis based on non-shootout PKs, meaning that it's excluding some historic moments such as:¶

  1. Edwin Van de sar's three penalty saves in the Community Shield (2007) against Chelsea, and the other two in the Champions League Final (2008), also against Chelsea
  2. Petr Cech single-handedly securing Chelsea's first Champions League victory against Bayern Munich in 2012 by denying Olic and Schweinsteiger.
  3. De Gea going zero for 11 (0/11) against Villareal in the Europa League final shootout (2021/2022).
  4. Neuer denying Kaka and Ronaldo, and with the help of Ramos sending it to the moon, beating Real Madrid in the Champions League (2011/2012)

(I'm having a hard time finding public data regarding shootouts, so I will expand on this as I manually collect relevant data on my own)¶

Now let's compare look into more advanced stats that may tell us more about where De Gea stands amonst other great players.¶

I will be looking into:¶
  1. Crosses_stp%: Percentage of crosses stopped
  2. Post Shot xG (PSxG) Prevention per 90: PSxG is the goals an average keeper is expected to concede, differentiated by the quality of the shot taken by the shooter. Thus, by subtracting the actual number of goals conceded, we can guage how well a keeper does compared to the average keeper when it comes to shot-stopping.
  3. Defensive Actions Outside of Penalty Area per 90 minutes (#OPA/90)
  4. Average Distance (AvgDist): Average distance covered when perfoming all defensive actions away from goal i.e. sweeping

with passing stats, i.e. completion rate for:

  1. Passes between 15 ~ 30 yards
  2. Passes longer than 30 yards
  3. Passes longer than 40 yards
  4. All Passes

data source: Fbref (e.g. link) note that these stats are available only for current, non-retired players; so Peter Schmeichel, van der Sar, Cech, Casillas, and Buffon will not be taken into consideration for this part of analysis

In [8]:
categories1 = ["Crosses_stp%","PSxG Prevention per 90","OPA/90","AvgDist","pass (15~30 yards)","pass (30 yards <)","pass (40 yards <)","Total Passing"]
categories = [*categories1, categories1[0]]
In [9]:
adv_stats = pd.DataFrame(np.array([[3.2,.05,2.93,23.3,98.4,66.3,48.3,86.7], [4.8,.12,2.28,18.7,98.5,61.6,44.3,85.2],[6.5,0,1.5,17.9,98.6,61.1,43.6,86.5],[6,.09,.96,14.8,98.9,50.7,32.6,80.6],[2.1,.04,.71,14.3,97.8,45.3,36.6, 70.6],[7.4,.09,1.22,16.6,98.7,53.6,40.9,77.5],
                                   [6.4,0.06,1.21,16.2,98.4,54.7,41.3,81.0]]),
                   columns=categories1,index=full_name[5:])
In [10]:
scaler = MinMaxScaler()
for col in adv_stats.columns:
    adv_stats[[col]] = scaler.fit_transform(adv_stats[[col]])
In [11]:
adv_stats
Out[11]:
Crosses_stp% PSxG Prevention per 90 OPA/90 AvgDist pass (15~30 yards) pass (30 yards <) pass (40 yards <) Total Passing
Manuel Neuer 0.207547 0.416667 1.000000 1.000000 0.545455 1.000000 1.000000 1.000000
Alisson Becker 0.509434 1.000000 0.707207 0.488889 0.636364 0.776190 0.745223 0.906832
Ederson 0.830189 0.000000 0.355856 0.400000 0.727273 0.752381 0.700637 0.987578
Thibaut Courtois 0.735849 0.750000 0.112613 0.055556 1.000000 0.257143 0.000000 0.621118
David De Gea 0.000000 0.333333 0.000000 0.000000 0.000000 0.000000 0.254777 0.000000
Diogo Costa 1.000000 0.750000 0.229730 0.255556 0.818182 0.395238 0.528662 0.428571
Andre Onana 0.811321 0.500000 0.225225 0.211111 0.545455 0.447619 0.554140 0.645963
In [12]:
# close lines
adv_stats['close_form'] = adv_stats['Crosses_stp%']    
In [13]:
fig2 = go.Figure()

opacity = .85

fig2.add_trace(go.Scatterpolar(
      r=adv_stats.loc['Manuel Neuer'].values,
      theta=categories,
      fill='toself',
    opacity=opacity,
      name='Manuel Neuer',
     marker_line_width=1.5,
    hovertext ='Manuel Neuer'
))
fig2.add_trace(go.Scatterpolar(
      r=adv_stats.loc['Alisson Becker'].values,
      theta=categories,
      fill='toself',
    opacity=opacity,
      name='Alisson Becker',
     marker_line_width=.15,
    hovertext='Alisson Becker'
))
fig2.add_trace(go.Scatterpolar(
      r=adv_stats.loc['Ederson'].values,
      theta=categories,
      fill='toself',
    opacity=opacity,
      name='Ederson',
     marker_line_width=1.5,
    hovertext = 'Ederson'
))
fig2.add_trace(go.Scatterpolar(
      r=adv_stats.loc['Thibaut Courtois'].values,
      theta=categories,
      fill='toself',
    opacity=opacity,
      name='Thibaut Courtois',
     marker_line_width=1.5,
      hovertext='Thibaut Courtois'
    
))
fig2.add_trace(go.Scatterpolar(
      r=adv_stats.loc['Diogo Costa'].values,
      theta=categories,
      fill='toself',
    opacity=opacity,
      name='Diogo Costa',
     marker_line_width=1.5,
      hovertext='Diogo Costa'
    
))
fig2.add_trace(go.Scatterpolar(
      r=adv_stats.loc['Andre Onana'].values,
      theta=categories,
      fill='toself',
    opacity=opacity,
      name='Andre Onana',
     marker_line_width=1.5,
      hovertext='Andre Onana'
    
))

fig2.add_trace(go.Scatterpolar(
      r=adv_stats.loc['David De Gea'].values,
      theta=categories,
      fill='toself',
      name='David De Gea',
     marker_line_width=1.5,
      hovertext='David De Gea'
    
))

fig2.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=False
    )),
  showlegend=True
)

fig2.update_layout(title_text='Comparison of (Normalized) Advanced Stats among Modern Keepers in their respective domestic leagues (2017~)',height=600)

fig2.show()

We can see that De Gea is lacking in many areas, coming in dead last for most of these stats barring PSxG Prevention per 90 and pass (40 yards <).¶

Now let's look at the 2022-2023 season across Big 5 European Leagues¶

In [14]:
%pip install lxml
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: lxml in /Users/chongkyungkim/micromamba/envs/dsdev/lib/python3.11/site-packages (4.9.3)
Note: you may need to restart the kernel to use updated packages.
In [15]:
df2223 = pd.read_html('https://fbref.com/en/comps/Big5/keepersadv/players/Big-5-European-Leagues-Stats')
In [16]:
df = df2223[0]
In [17]:
cols = df.columns
In [18]:
df = df[[cols[1],cols[17],cols[20],cols[29],cols[32],cols[33]]]
In [19]:
df.head()
Out[19]:
Unnamed: 1_level_0 Expected Launched Crosses Sweeper
Player /90 Cmp% Stp #OPA/90 AvgDist
0 Álvaro Aceves +0.90 50.0 0 13.85 33.0
1 Julen Agirrezabala -0.06 36.6 11 1.33 15.4
2 Doğan Alemdar -0.32 34.3 5 1.11 14.8
3 Alisson +0.27 41.0 23 2.41 19.8
4 Alphonse Areola +0.09 37.8 2 0.29 10.3
In [20]:
df = df.dropna()
In [21]:
df.columns = ['player','PSx90PrevPer90','40+_completion_%','cross_stop_%',"sweeper_action_90","sweep_avg_dist"]
df.head()
Out[21]:
player PSx90PrevPer90 40+_completion_% cross_stop_% sweeper_action_90 sweep_avg_dist
0 Álvaro Aceves +0.90 50.0 0 13.85 33.0
1 Julen Agirrezabala -0.06 36.6 11 1.33 15.4
2 Doğan Alemdar -0.32 34.3 5 1.11 14.8
3 Alisson +0.27 41.0 23 2.41 19.8
4 Alphonse Areola +0.09 37.8 2 0.29 10.3
In [22]:
df = df[~df['PSx90PrevPer90'].isin(['/90'])]
In [23]:
df[['PSx90PrevPer90','40+_completion_%', 'cross_stop_%',
       'sweeper_action_90', 'sweep_avg_dist']] = df[['PSx90PrevPer90','40+_completion_%', 'cross_stop_%',
       'sweeper_action_90', 'sweep_avg_dist']].apply(pd.to_numeric)
In [24]:
df.head(195)
Out[24]:
player PSx90PrevPer90 40+_completion_% cross_stop_% sweeper_action_90 sweep_avg_dist
0 Álvaro Aceves 0.90 50.0 0 13.85 33.0
1 Julen Agirrezabala -0.06 36.6 11 1.33 15.4
2 Doğan Alemdar -0.32 34.3 5 1.11 14.8
3 Alisson 0.27 41.0 23 2.41 19.8
4 Alphonse Areola 0.09 37.8 2 0.29 10.3
... ... ... ... ... ... ...
206 Guglielmo Vicario 0.09 30.3 34 0.71 11.5
208 Iván Villar -0.09 33.9 8 0.74 14.2
209 Danny Ward -0.21 31.0 20 1.62 15.9
210 Axel Werner -0.82 20.0 1 0.50 14.0
211 Joseph Whitworth -1.33 29.6 1 1.50 15.4

195 rows × 6 columns

In [25]:
features = ['PSx90PrevPer90','40+_completion_%', 'cross_stop_%',
       'sweeper_action_90', 'sweep_avg_dist']
X = df[features]
In [26]:
names = df['player'].values
In [27]:
max_clusters = 20
ks = range(2, max_clusters+1)
clusterers = [KMeans(n_clusters=k, n_init=50, random_state=109).fit(X) for k in ks] 
In [28]:
modern_gks =['Alisson','Ederson','Diogo Costa','David de Gea', 'André Onana','Kepa Arrizabalaga',
             'Mike Maignan','Jordan Pickford','Nick Pope','Jason Steele',
             'Thibaut Courtois','Dean Henderson','Hugo Lloris','Robert Sánchez','Danny Ward',
             'Keylor Navas','Unai Simón','Gianluigi Donnarumma','Jan Oblak','Rui Patrício',
            'Aaron Ramsdale','José Sá','Neto','Illan Meslier','Emiliano Martínez','Bernd Leno','Vicente Guaita',
            'Gavin Bazunu','Łukasz Fabiański','Fraser Forster','Bernd Leno','Alex McCarthy','Daniel Iversen','Mark Travers',
            'Marc-André ter Stegen','Yann Sommer']
In [29]:
pass_df = pd.read_html('https://fbref.com/en/comps/Big5/passing/players/Big-5-European-Leagues-Stats')
passdf = pass_df[0]
In [30]:
passdf.columns
Out[30]:
MultiIndex([( 'Unnamed: 0_level_0',      'Rk'),
            ( 'Unnamed: 1_level_0',  'Player'),
            ( 'Unnamed: 2_level_0',  'Nation'),
            ( 'Unnamed: 3_level_0',     'Pos'),
            ( 'Unnamed: 4_level_0',   'Squad'),
            ( 'Unnamed: 5_level_0',    'Comp'),
            ( 'Unnamed: 6_level_0',     'Age'),
            ( 'Unnamed: 7_level_0',    'Born'),
            ( 'Unnamed: 8_level_0',     '90s'),
            (              'Total',     'Cmp'),
            (              'Total',     'Att'),
            (              'Total',    'Cmp%'),
            (              'Total', 'TotDist'),
            (              'Total', 'PrgDist'),
            (              'Short',     'Cmp'),
            (              'Short',     'Att'),
            (              'Short',    'Cmp%'),
            (             'Medium',     'Cmp'),
            (             'Medium',     'Att'),
            (             'Medium',    'Cmp%'),
            (               'Long',     'Cmp'),
            (               'Long',     'Att'),
            (               'Long',    'Cmp%'),
            ('Unnamed: 23_level_0',     'Ast'),
            ('Unnamed: 24_level_0',     'xAG'),
            ('Unnamed: 25_level_0',      'xA'),
            ('Unnamed: 26_level_0',   'A-xAG'),
            ('Unnamed: 27_level_0',      'KP'),
            ('Unnamed: 28_level_0',     '1/3'),
            ('Unnamed: 29_level_0',     'PPA'),
            ('Unnamed: 30_level_0',   'CrsPA'),
            ('Unnamed: 31_level_0',    'PrgP'),
            ('Unnamed: 32_level_0', 'Matches')],
           )
In [31]:
passdf = passdf[[( 'Unnamed: 1_level_0',  'Player'), ('Unnamed: 3_level_0','Pos'),('Medium','Cmp%'),('Long','Cmp%'),('Unnamed: 28_level_0',     '1/3')]]
In [32]:
passdf.head()
Out[32]:
Unnamed: 1_level_0 Unnamed: 3_level_0 Medium Long Unnamed: 28_level_0
Player Pos Cmp% Cmp% 1/3
0 Brenden Aaronson MF,FW 76.9 38.5 47
1 Paxten Aaronson MF,DF 60.9 16.7 3
2 James Abankwah DF 75.0 40.0 0
3 George Abbott MF NaN NaN 0
4 Yunis Abdelhamid DF 90.1 55.6 155
In [33]:
passdf.columns = ['player','pos','med_completion_rate','long_completion_rate','final_third']
In [34]:
passdf = passdf[passdf['pos'] == 'GK']
In [35]:
passdf = passdf.dropna()
In [36]:
passdf[['med_completion_rate',
       'long_completion_rate', 'final_third']] = passdf[['med_completion_rate',
       'long_completion_rate', 'final_third']].apply(pd.to_numeric)
In [37]:
tdf = pd.merge(df, passdf, on="player")
In [38]:
tdf.head()
Out[38]:
player PSx90PrevPer90 40+_completion_% cross_stop_% sweeper_action_90 sweep_avg_dist pos med_completion_rate long_completion_rate final_third
0 Álvaro Aceves 0.90 50.0 0 13.85 33.0 GK 100.0 50.0 0
1 Julen Agirrezabala -0.06 36.6 11 1.33 15.4 GK 96.8 43.2 2
2 Doğan Alemdar -0.32 34.3 5 1.11 14.8 GK 100.0 36.4 2
3 Alisson 0.27 41.0 23 2.41 19.8 GK 98.6 58.2 16
4 Alphonse Areola 0.09 37.8 2 0.29 10.3 GK 100.0 44.2 0
In [39]:
categories = ['PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
       'sweeper_action_90', 'sweep_avg_dist',
       'med_completion_rate', 'long_completion_rate', 'final_third']
In [40]:
for cat in categories:
    tdf[[cat]] = scaler.fit_transform(tdf[[cat]])
In [41]:
tdf.shape
Out[41]:
(210, 10)
In [42]:
tdf.columns
Out[42]:
Index(['player', 'PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
       'sweeper_action_90', 'sweep_avg_dist', 'pos', 'med_completion_rate',
       'long_completion_rate', 'final_third'],
      dtype='object')
In [43]:
total_features = ['PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
       'sweeper_action_90', 'sweep_avg_dist',
       'med_completion_rate', 'long_completion_rate', 'final_third']
In [44]:
X = tdf[total_features]

clusterers = [KMeans(n_clusters=k, n_init=50, random_state=109).fit(X) for k in ks] 
In [45]:
inertias = [c.inertia_ for c in clusterers]
plt.plot(ks, inertias, 'o-')
plt.xticks(ks[::2])
plt.xlabel('$k$')
plt.ylabel('inertia');
plt.suptitle('KMeans Clustering of Keepers')
plt.title('Inertia vs Number of Clusters');
No description has been provided for this image
In [46]:
sil_scores = [silhouette_score(X, c.labels_) for c in clusterers]
plt.plot(ks, sil_scores, 'o-')
plt.xticks(ks[::2])
plt.xlabel('$k$')
plt.ylabel('silhouette score')
plt.suptitle('KMeans Clustering of Keepers')
plt.title('Silhouette vs Number of Clusters');
No description has been provided for this image
In [47]:
best_k = 3
kmeans = KMeans(n_clusters=best_k, n_init=50, random_state=109).fit(X.values)
labels = kmeans.labels_
In [48]:
pca = PCA(n_components=2).fit(X)
# project data onto 2D space spanned by components
X_pca = pca.transform(X)
X_pca.shape
Out[48]:
(210, 2)
In [49]:
names = tdf['player'].values
In [50]:
plt.figure(figsize=(10,10))

# data points colored by cluster labels
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=.85,c=plt.cm.Accent(labels))

# annotate animal names for our random subset 
for i in range(X_pca.shape[0]):
    name = names[i]
    if name in modern_gks:
        a = plt.annotate(names[i], (X_pca[i]),size=5)

plt.xlabel(f'PCA1 ({pca.explained_variance_ratio_[0]:.2%} var explained)')
plt.ylabel(f'PCA1 ({pca.explained_variance_ratio_[1]:.2%} var explained)');
plt.title('KMeans Clustering of GKs in the Big 5 European Leagues (PCA Projection)');
No description has been provided for this image

With the passing data included, I'm not sure how I feel about De Gea's company! Takeaways:¶

  1. Dean Henderson might not be the upgrade on De Gea as some people might think he could be, based on this analysis. He doesn't necessarily bring a different type of a playing style, either.
  2. Again, based on this clustering scheme, this gives some idea as to how different De Gea's profile might be compared to those of Alisson, Ederson, Onana, Unai Simon, and other more proactive keepers.
  3. Many goalkeepers that play for top teams seem to have their goalkeepers in the orange and purple groups, and not the one De Gea belongs to

A look inside the cluster groups¶

In [51]:
tdf['group'] = labels
In [52]:
tdf.describe()
Out[52]:
PSx90PrevPer90 40+_completion_% cross_stop_% sweeper_action_90 sweep_avg_dist med_completion_rate long_completion_rate final_third group
count 210.000000 210.00000 210.000000 210.000000 210.000000 210.000000 210.000000 210.000000 210.000000
mean 0.571326 0.36829 0.240714 0.082898 0.363228 0.855810 0.463881 0.083824 1.090476
std 0.109178 0.09896 0.213275 0.079246 0.124144 0.154953 0.162471 0.122894 0.872989
min 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.530997 0.31825 0.037500 0.050181 0.298305 0.802000 0.350417 0.007353 0.000000
50% 0.582210 0.35950 0.208333 0.072202 0.362712 0.880000 0.458333 0.044118 1.000000
75% 0.619272 0.41275 0.366667 0.102527 0.430508 0.944000 0.557083 0.117647 2.000000
max 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000
In [53]:
tdf['group'].value_counts()
Out[53]:
group
2    90
0    71
1    49
Name: count, dtype: int64
In [54]:
tdf['close'] = tdf['PSx90PrevPer90']
In [55]:
groups = {}
for i in range(3):
    groups[f"group{i+1}"] = tdf[tdf['group'] == i]
In [56]:
group1 = groups['group1'].describe()
group2 = groups['group2'].describe()
group3 = groups['group3'].describe()
# group4 = groups['group4'].describe()
In [57]:
group1.head(8)
Out[57]:
PSx90PrevPer90 40+_completion_% cross_stop_% sweeper_action_90 sweep_avg_dist med_completion_rate long_completion_rate final_third group close
count 71.000000 71.000000 71.000000 71.000000 71.000000 71.000000 71.000000 71.000000 71.0 71.000000
mean 0.583311 0.364930 0.487793 0.086307 0.373168 0.847662 0.439343 0.173364 0.0 0.583311
std 0.045630 0.053626 0.149708 0.039546 0.086193 0.078167 0.110636 0.169668 0.0 0.045630
min 0.450135 0.236000 0.233333 0.021661 0.169492 0.552000 0.183333 0.000000 0.0 0.450135
25% 0.552561 0.323000 0.366667 0.055596 0.313559 0.800000 0.356667 0.077206 0.0 0.552561
50% 0.590296 0.366000 0.483333 0.082310 0.386441 0.856000 0.445000 0.117647 0.0 0.590296
75% 0.609164 0.402000 0.583333 0.107942 0.430508 0.896000 0.508333 0.205882 0.0 0.609164
max 0.671159 0.471000 1.000000 0.257040 0.633898 1.000000 0.673333 1.000000 0.0 0.671159
In [58]:
group2.head(8)
Out[58]:
PSx90PrevPer90 40+_completion_% cross_stop_% sweeper_action_90 sweep_avg_dist med_completion_rate long_completion_rate final_third group close
count 49.000000 49.000000 49.000000 49.000000 49.000000 49.000000 49.000000 49.000000 49.0 49.000000
mean 0.568458 0.455571 0.106122 0.100995 0.415842 0.917224 0.663095 0.037065 1.0 0.568458
std 0.141109 0.113240 0.109592 0.141600 0.178160 0.081593 0.118357 0.045795 0.0 0.141109
min 0.231806 0.297000 0.000000 0.000000 0.016949 0.728000 0.430000 0.000000 1.0 0.231806
25% 0.528302 0.394000 0.016667 0.051264 0.349153 0.872000 0.578333 0.007353 1.0 0.528302
50% 0.563342 0.445000 0.050000 0.085921 0.433898 0.920000 0.646667 0.014706 1.0 0.563342
75% 0.622642 0.500000 0.200000 0.120578 0.505085 1.000000 0.736667 0.051471 1.0 0.622642
max 0.878706 1.000000 0.350000 1.000000 1.000000 1.000000 1.000000 0.161765 1.0 0.878706
In [59]:
group3.head(8)
Out[59]:
PSx90PrevPer90 40+_completion_% cross_stop_% sweeper_action_90 sweep_avg_dist med_completion_rate long_completion_rate final_third group close
count 90.000000 90.000000 90.000000 90.000000 90.000000 90.000000 90.000000 90.000000 90.0 90.000000
mean 0.563432 0.323422 0.119074 0.070357 0.326742 0.828800 0.374778 0.038644 2.0 0.563432
std 0.124230 0.087066 0.095254 0.048529 0.101392 0.212416 0.120195 0.045052 0.0 0.124230
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.0 0.000000
25% 0.500000 0.291500 0.033333 0.041155 0.283051 0.786000 0.315417 0.007353 2.0 0.500000
50% 0.575472 0.333000 0.091667 0.071480 0.340678 0.884000 0.381667 0.022059 2.0 0.575472
75% 0.625337 0.360000 0.216667 0.090794 0.383051 0.990000 0.462500 0.044118 2.0 0.625337
max 1.000000 0.615000 0.300000 0.361011 0.603390 1.000000 0.618333 0.235294 2.0 1.000000
In [60]:
group1.columns
Out[60]:
Index(['PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
       'sweeper_action_90', 'sweep_avg_dist', 'med_completion_rate',
       'long_completion_rate', 'final_third', 'group', 'close'],
      dtype='object')
In [61]:
feats = ['PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
       'sweeper_action_90', 'sweep_avg_dist', 'med_completion_rate',
       'long_completion_rate', 'final_third', 'PSx90PrevPer90']
In [62]:
tdf[tdf['player']=='David de Gea']
Out[62]:
player PSx90PrevPer90 40+_completion_% cross_stop_% sweeper_action_90 sweep_avg_dist pos med_completion_rate long_completion_rate final_third group close
60 David de Gea 0.584906 0.314 0.25 0.06065 0.383051 GK 0.84 0.393333 0.051471 2 0.584906

De Gea is in Group 3; let's see how their group compared to the others¶

In [63]:
fig4 = go.Figure()


fig4.add_trace(go.Scatterpolar(
      r = group1[feats].loc['mean'].values,
      theta = feats,
      fill='toself',
      name='Group 1',
     marker_line_width=1.5,
    hovertext ='Group 1'
))
fig4.add_trace(go.Scatterpolar(
      r=group2[feats].loc['mean'].values,
      theta=feats,
      # fill='toself',
      name='Group 2',
     marker_line_width=.15,
    hovertext='Group 2'
))
fig4.add_trace(go.Scatterpolar(
      r=group3[feats].loc['mean'].values,
      theta=feats,
      # fill='toself',
      name='Group 3',
     marker_line_width=1.5,
    hovertext = 'Group 3'
))
# fig4.add_trace(go.Scatterpolar(
#       r=group4[feats].loc['mean'].values,
#       theta=feats,
#       # fill='toself',
#       name='Group 4',
#      marker_line_width=1.5,
#     hovertext = 'Group 4'
# ))


fig4.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=False
    )),
  showlegend=True
)

fig4.update_layout(title_text='Comparison of (Normalized) Average of Advanced Stats among Profile Groups in their respective domestic leagues (2017~)',height=600)

fig4.show()
In [64]:
groups['group1']['player'].values
Out[64]:
array(['Alisson', 'Emil Audero', 'Édgar Badía', 'Oliver Baumann',
       'Gavin Bazunu', 'Paul Bernardoni', 'Marco Bizot', 'Janis Blaswich',
       'Yassine Bounou', 'Marco Carnesecchi', 'Koen Casteels',
       'Lucas Chevalier', 'Oliver Christensen', 'Andrea Consigli',
       'Thibaut Courtois', 'Mory Diaw', 'Stole Dimitrievski',
       'Yehvann Diouf', 'Gianluigi Donnarumma', 'Maxime Dupé',
       'Łukasz Fabiański', 'Wladimiro Falcone', 'Aitor Fernández',
       'Fernando', 'Mark Flekken', 'Gauthier Gallon', 'Paulo Gazzaniga',
       'Rafał Gikiewicz', 'Ivo Grbić', 'Vicente Guaita', 'Dean Henderson',
       'Sergio Herrera', 'Lukáš Hrádecký', 'Alban Lafont',
       'Jeremías Ledesma', 'Bernd Leno', 'Benjamin Leroy', 'Hugo Lloris',
       'Anthony Lopes', 'Pau López', 'Giorgi Mamardashvili',
       'Steve Mandanda', 'Emiliano Martínez', 'Illan Meslier',
       'Vanja Milinković-Savić', 'Lorenzo Montipò', 'Yvon Mvogo', 'Neto',
       'Alexander Nübel', 'Jan Oblak', 'Jiří Pavlenka', 'Jordan Pickford',
       'Nick Pope', 'Predrag Rajković', 'Aaron Ramsdale', 'David Raya',
       'Álex Remiro', 'Manuel Riemann', 'Frederik Rønnow', 'José Sá',
       'Brice Samba', 'Kasper Schmeichel', 'Marvin Schwäbe', 'Matz Sels',
       'Rui Silva', 'Unai Simón', 'Łukasz Skorupski', 'David Soria',
       'Guglielmo Vicario', 'Danny Ward', 'Robin Zentner'], dtype=object)
In [65]:
groups['group2']['player'].values
Out[65]:
array(['Álvaro Aceves', 'Kepa Arrizabalaga', 'Fabian Bredlow',
       'Juan Carlos', 'Michele Cerofolini', 'Alessio Cragno',
       'Rémy Descamps', 'Martin Dúbravka', 'Ederson', 'Álvaro Fernández',
       'Joan García', 'Pierluigi Gollini', 'Pierluigi Gollini',
       'Pierluigi Gollini', 'Pierluigi Gollini', 'Dominik Greif',
       'Péter Gulácsi', 'Samir Handanović', 'Caoimhín Kelleher',
       'Gregor Kobel', 'Jean-Louis Leca', 'Benjamin Lecomte',
       'Andriy Lunin', 'Mike Maignan', 'Federico Marchetti',
       'Diego Mariño', 'Alex Meret', 'Alexander Meyer', 'Florian Müller',
       'Manuel Neuer', 'André Onana', 'Stefan Ortega', 'Fernando Pacheco',
       'Rui Patrício', 'Gianluca Pegolo', 'Iñaki Peña', 'Ivan Provedel',
       'Leonardo Román', 'Gerónimo Rulli', 'Mouhamadou Sarr',
       'Salvatore Sirigu', 'Yann Sommer', 'Yann Sommer', 'Jason Steele',
       'Wojciech Szczęsny', 'Ciprian Tătărușanu', 'Marc-André ter Stegen',
       'Pietro Terracciano', 'Michael Zetterer'], dtype=object)
In [68]:
groups['group3']['player'].unique()
Out[68]:
array(['Julen Agirrezabala', 'Doğan Alemdar', 'Alphonse Areola',
       'Sergio Asenjo', 'Asmir Begović', 'Daniel Bentley', 'Rubén Blanco',
       'Joaquín Blázquez', 'Claudio Bravo', 'Marcin Bułka',
       'Matis Carvalho', 'Benoît Costil', 'Finn Dahmen',
       'Michele Di Gregorio', 'Ouparine Djoco', 'Marko Dmitrović',
       'Bartłomiej Drągowski', 'Tjark Ernst', 'Ralf Fährmann',
       'Vincenzo Fiorillo', 'Yahia Fofana', 'Fraser Forster',
       'David de Gea', 'David Gil', 'Lennart Grill', 'Wayne Hennessey',
       'Daniel Iversen', 'Sam Johnstone', 'Filip Jørgensen',
       'Bingourou Kamara', 'Tomáš Koubek', 'Benjamin Lecomte',
       'Donovan Léon', 'Mateusz Lis', 'Diego López', 'Andrey Lunyov',
       'Vito Mannone', 'Agustín Marchesín', 'Jordi Masip',
       'Alex McCarthy', 'Edouard Mendy', 'Juan Musso', 'Keylor Navas',
       'Ørjan Nyland', 'Guillermo Ochoa', 'Jan Olschowsky', 'Robin Olsen',
       'Jonas Omlin', 'Fernando Pacheco', 'Patrick Pentz',
       'Simone Perilli', 'Mattia Perin', 'Samuele Perisan',
       'Ghjuvanni Quilichini', 'Ionuț Radu', 'Diant Ramaj',
       'Nicola Ravaglia', 'Pepe Reina', 'Rémy Riou', 'Joel Robles',
       'Marek Rodák', 'Alessandro Russo', 'Alexander Schwolow',
       'Luigi Sepe', 'Marco Silvestri', 'Tobias Sippel',
       'François-Joseph Sollacaro', 'Yann Sommer', 'Marco Sportiello',
       'Mile Svilar', 'Kevin Trapp', 'Mark Travers', 'Martin Turk',
       'Sven Ulreich', 'Iván Villar', 'Axel Werner', 'Joseph Whitworth',
       'Jeroen Zoet', 'Petar Zovko'], dtype=object)

De Gea, based on this clustering analysis, is associated with keepers with:¶

  1. Group 1 seems to include keepers who are proactive, claiming crosses the most with decent defensive actions and are okay passers. e.g. Alisson, Courtois, Nick Pope, Pickford, Ramsdale, Unai Simon
  2. Group 2 seems to include keepers who are great passers with okay proactive, defensive actions e.g. Ederson, Jason Steele, ter Stegen, Neuer
  3. Group 3 seems to include keepers who aren't particularly great at anything
  4. The difference is minimal, but group 3 also has the worst goal prevention based on PSxG.
  5. Group 3 keepers have the worst average stat for all metrics barring crosses stopped percentage and passes into the final third.
  6. De gea, mind you, belongs to Group 3. Make of that what you will!
In [70]:
tdf.head()
Out[70]:
player PSx90PrevPer90 40+_completion_% cross_stop_% sweeper_action_90 sweep_avg_dist pos med_completion_rate long_completion_rate final_third group close
0 Álvaro Aceves 0.832884 0.500 0.000000 1.000000 1.000000 GK 1.000 0.500000 0.000000 1 0.832884
1 Julen Agirrezabala 0.574124 0.366 0.183333 0.096029 0.403390 GK 0.744 0.386667 0.014706 2 0.574124
2 Doğan Alemdar 0.504043 0.343 0.083333 0.080144 0.383051 GK 1.000 0.273333 0.014706 2 0.504043
3 Alisson 0.663073 0.410 0.383333 0.174007 0.552542 GK 0.888 0.636667 0.117647 0 0.663073
4 Alphonse Areola 0.614555 0.378 0.033333 0.020939 0.230508 GK 1.000 0.403333 0.000000 2 0.614555
In [73]:
tdf.columns
Out[73]:
Index(['player', 'PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
       'sweeper_action_90', 'sweep_avg_dist', 'pos', 'med_completion_rate',
       'long_completion_rate', 'final_third', 'group', 'close'],
      dtype='object')
In [98]:
tdf.loc[tdf['player']=='Aaron Ramsdale'].values[0][1:]
Out[98]:
array([0.5768194070080862, 0.254, 0.36666666666666664, 0.0815884476534296,
       0.42711864406779665, 'GK', 0.8560000000000008, 0.26333333333333325,
       0.22058823529411764, 0, 0.5768194070080862], dtype=object)
In [76]:
import plotly.express as px

Percentile Visualizer¶

In [128]:
def pv(player):
    df = tdf.loc[tdf['player']==player]
    df = df[['PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
       'sweeper_action_90', 'sweep_avg_dist', 'med_completion_rate',
       'long_completion_rate', 'final_third']]

    pcts = df.values[0]*100
    fig, ax = plt.subplots()

    ax.barh(df.columns, pcts)
    
    ax.set_ylabel('Attributes')
    ax.set_title(f'{player} Attribute Percentile',)
    # ax.legend(title='Fruit color')
    
    for index, value in enumerate(pcts):
        value = round(value,2)
        plt.text(value, index,str(value))
    
    plt.show()
In [134]:
tdf['player']
Out[134]:
0           Álvaro Aceves
1      Julen Agirrezabala
2           Doğan Alemdar
3                 Alisson
4         Alphonse Areola
              ...        
205      Joseph Whitworth
206         Robin Zentner
207      Michael Zetterer
208           Jeroen Zoet
209           Petar Zovko
Name: player, Length: 210, dtype: object
In [136]:
pv('Jason Steele')
No description has been provided for this image
In [ ]: